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ABSTRACT 

If an extended source, such as a galaxy, is gravitationally lensed by a massive object in the fore- 
ground, the lensing distorts the observed image. It is straightforward to simulate what the observed 
image would be for a particular lens and source combination. In practice, one observes the lensed 
image on the sky, but blurred by atmospheric and telescopic effects and also contaminated with noise. 
The question that then arises is, given this incomplete data, what combinations of lens mass dis- 
tribution and source surface brightness profile could plausibly have produced this image? This is a 
classic example of an inverse problem, and the method for solving it is given by the framework of 
Bayesian inference. In this paper we demonstrate the application of Bayesian inference to the problem 
of gravitational lens reconstruction, and illustrate the use of Markov Chain Monte Carlo simulations 
which can be used when the analytical calculations become too difficult. Previous methods for per- 
forming gravitational lens inversion are seen in a new light, as special cases of the general approach 
presented in this paper. Thus, we are able to answer, at least in principle, lingering questions about 
the uncertainties in the reconstructed source and lens parameters, taking into account all of the data 
and any prior information we may have. 

Subject headings: gravitational lensing — methods: data analysis — methods: statistical 



1. INTRODUCTION 

The number of known gravitati onal lenses has grow' n 
steadily over recent decades (e.g. iCabanac et al.ll2005j) . 
They hold much promise for being able to constrain 
theories about dark matter (by probing the mass dis- 
tribution of the lensing object), and also for being able 
to magnify distant sources into view, that would have 
been unobservable otherwise. However, it is slightly 
disconcerting that as the number of observed gravita- 
tional lens systems has increased, so has the number 
of seemingly distinct methods for ana lysing them. A 
few e xamples are: Th e Ring Cy c le llKochanek et al.l 

LensMEM 
IWavth et all 



IT989I). LensCLEAN (IWucknitzl 
llWallington. Kochanek. fc^aravanl 



12005V Semi-hnear Inversion ('Warren fc Dvell2003D . and 
Genetic Algorithms (jBrewer fc Lewis 2005j) . 

These techniques all have their own particular advan- 
tages and disadvantages, and it is not clear whether or 
not there is one method that is in some sense more justi- 
fied than the others. In this paper we seek to unify these 
approaches and show that they are all different methods 
of doing essentially the same thing. Of course, practi- 
cal considerations such as computational efficiency must 
also be taken into account in judging the relative mer- 
its of different inversion methods, although this is not 
such a huge issue due to increasing computer power and 
the relative rarity of discoveries of these systems. In this 
paper, we only consider the question of theoretical jus- 
tification of these methods; discussions of the practical 
considerations can be found in the respective literature. 

2. BAYESIAN INFERENCE 

Bayesian Inference is an approach to statistical anal- 
ysis that is based almost entirely on probability theory. 



viewed as a ge neralisation of logic to include uncertainty 
l|Lorednin99,5ll . This was the original approach to prob- 
ability theory as developed by Laplace, but was largely 
rejected in the early 20th century in favour of the "fre- 
quentist" school of thought, where probabilities can only 
be interpreted as long-run frequencies in a random exper- 
iment. However, Bayesian methods enjoyed a resurgence 
in the latter ha lf of t he 20t h century. One reason for this 
is the proofs of lCoxl l)194(iD . who showed that any system 
which measures plausibility with real numbers, and sat- 
isfies several consistency criteria, is equivalent to prob- 
ability theory. More convincing, however, is the rapidly 
growing number of practical examples of its use in real 
applications (e.g . iLewis fc Bridldl200l lEsch et alJl200l 
IWheatlandll20M and many others). 

The product rule of probability is as follows. For two 
propositions A and B, the probability that both A and 
B are true, given some background information and/or 
assumptions /, is given by 



P{AB\I) = P{A\I)P{B\AI) = P{B\I)P{A\BI) (1) 

where P{B\AI) is the probability that B is true, given 
that A is true. Rearranging the above equation gives 
Bayes' theorem: 



P{A\BI) 



P{B\I) 



P{B\AI) 



(2) 
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By considering propositions of the form "The quantity 
X lies in the interval {x,x -\- dxy\ it can be shown that 
a version of Bayes's theorem also holds for probability 
density functions (PDFs) of continuous variables. Thus, 
whenever you want to learn the value of some unknown 
quantities ^, given some other known quantities (the data 
D) that depend on the 9 values in a probabilistic (i.e. not 
entirely predictable) way, then the correct procedure is 
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to calculate the probability distribution of the unknown 
quantity, given the data: 

p{e\Di) (xp{e\i)p{D\ei) (3) 

which is the "posterior" PDF of the parameters given 
the data. The p{D\I) term which should be in the de- 
nominator has been omitted, because it does not depend 
on 9 and can be absorbed into the normalisation con- 
stant. Thus, Bayes' theorem tells us how to modify our 
state of knowledge of the unknown 9 values by taking 
the data D into account. However, the posterior PDF 
depends on p{9\I), the so called "prior distribution" of 
the unknown quantities. In other words, we can't learn 
from the data unless we specify our prior knowledge (or 
ignorance) about the values of those unknown parame- 
ters, before observing the data. 

The second term in equation|31 the probability density 
of the data if we knew the values of the parameters, is 
called the sampling distribution if we consider a fixed set 
of parameters, and imagine calculating the probability 
of obtaining different data sets D. However, for a fixed 
data set (the one that is observed), and considering its 
dependence on 9, it is called the likelihood function. It 
is immediately clear that the "maximum likelihood esti- 
mate" of statistics is just the mode of the posterior dis- 
tribution, for a uniform prior distribution, because the 
posterior PDF is proportio nal to the lik elihood function 
in this case. As shown bv IPressI l|2002|) . this leads to a 
neat justification for the method of least squares, as being 
appropriate when the sampling distribution is Gaussian. 

The presence of the prior term in the Bayes ian ap- 
proac h has been the subject of much controversy l|Loredol 
Il995(l . and is usually the reason why some people avoid 
it. The point of the prior distribution is not to describe 
randomness or variation of the quantity, but our state of 
knowledge about it. However, the results of a calculation 
are generally insensitive to the choice of prior, particu- 
larly when it is diffuse compared to the likelihood func- 
tion. In addition, conventional statistics, while appear- 
ing more "objective" due to the lack of things like prior 
distributions, has merely swept the matter under the car- 
pet. Use of statistical methods that don't make reference 
to prior distributions can usually be shown to be equiva- 
lent to a specific choice of prior; the maximum likelihood 
estimate mentioned above is one example. There also 
exist a limited number (at this time) of general princi- 
ples for the objective assignment of prior distributions, 
given particular types of prior information ( Javncs 2003). 
A simple example is the principle of insufficient reason, 
given by Laplace; if we only know that there are N pos- 
sibilities that are mutually exclusive and exhaustive, and 
nothing more, we assign equal prior probabilities of 1/N 
to each. 

3. GRAVITATIONAL LENSING 

One of the main observable predictions of the theory 
of general relativity is that light rays can be bent as 
they pass through the gravitational field of a massive 
object. Usually, we assume that all of the mass in the 
lensing object is located in a plane, called the lens plane. 
If the coordinates on the lens plane are x and y, the 
mass distribution of the lensing object is described by 
a function p(x,y). A light ray passing through the lens 



plane will be bent by an angle called the deflection angle 
a{x,y). For the case of a point mass lens, the deflection 
angle is directed towards the mass, and has magnitude 
proportional to l/r, where r is the distance from the 
point mass. For the case of a continuous mass density, 
the deflection angle at a point (x, y) in the lens plane 
is obtained by integrating the deflection angles due to 
all other mass elements. In the usual scaled coordinates, 
any light that is observed at position {x,y) in the image 
plane had its origin at the position (x^, ys) in the source 
plane, given by 

Xs=x-axix,y) (4) 

ys = y-ay{x,y) (5) 

Note that equation 0] gives a unique source plane po- 
sition {xs,ys) for a given image plane position {x,y). In 
general, though, the inverse function does not exist, so 
any particular position {xs,ys) in the source plane can 
be mapped to multiple positions in the lens plane. This 
is the mathematical reason why the background source is 
often multiply imaged in gravitational lens systems. For 
an extended source, the source may be described by a 
surface brightness profile S{xs,ys)- The observed image 
is usually considered as a function over the lens plane, 
I{x, y). Since surface brightness (intensity per unit area) 
is not changed by lensing, we have that the observed im- 
age is given by 

I{x,y) = S{xs{x,y),ys{x,y)) = S(x-a^{x,y),y-ay{x,y)) 

(6) 

When the source and image are pixellated, this relation 
does not hold exactly because the image consists of the 
integrated surface brightness over the pixels. However, 
there is a linear relationship between the source and the 
image in this case, which is given in section 15.31 

4. METHOD 

In this section we present the general procedure of the 
calculations that need to be done, but the equations are 
left in a generic form, appropriate for whatever parame- 
terisation of lens and source is chosen. Later on, we will 
specialise to the case of a pixellated source and a param- 
eterised lens model. In the context of lensing, we have 
an underlying source S and a lens L, which we wish to 
infer from observations, which are the observed image O. 
In other words, we want the joint posterior distribution 
of the S and L parameters (whatever quantities we are 
using in our mathematical description of the source and 
lens), given O. Bayes' theorem states 

p{LS\OI) cx p{LS\I)piO\LSI) (7) 

where the factor p{0\LSI) is determined by the model 
of the noise. If the lens parameters are of interest, but the 
source is not, then the inference about the lens is given 
by the marginal probability density of the L parameters: 

p{L\OI) (X J p{LS\I)p{0\LSI)dS (8) 

In general, any prior knowledge of the source will not 
affect our prior knowledge about the lens, and vice versa. 



3 



Therefore the joint prior PDF for L and S can be fac- 
torised into independent priors for L and S*, giving 

p{L\OI) (X p{L\I) j p{S\I)p{0\LSI)dS (9) 

In some special cases, it may be possible to evaluate 
these integrals analytically. This is not the case in the 
following discussion, however there are numerical meth- 
ods which can be used. 

5. PIXELLATED SOURCES 

When all is said and done, we would like to describe 
our knowledge of the source and the lens by a finite 
number of values, and the uncertainties in those val- 
ues. In this paper, we are concerned with the case of 
extended sources, which can have detailed structure in 
them. Thus, we need to choose a parametrisation that 
allows for the possibility of arbitrarily complicated struc- 
tures in the source. Pixels are the obvious choice. 

5.1. What are we doing? 

Suppose we are given an observed image. For any lens 
model we choose, it is possible to find a source light dis- 
tribution that reproduces the observed image. In the 
case that two points with different brightness in the im- 
age map to almost the same point in the source plane, we 
could get around this by having rapidly varying structure 
in the source, with extremely high positive and negative 
values of surface brightness. However, nobody would se- 
riously propose such a reconstruction; we already have 
prior knowledge that rules out this possibility. We ex- 
pect in advance that the source should be non-negative 
and smooth on some scale. There are several ways of 
taking this into account, using pixels is one way, because 
structure within pixels is impossible. Other possibilities 
include using non-square basis functions (for example, 
reconstructing the source as a combination of Gaussian 
blobs), use of "fuzzy pixels", or by choosing a prior dis- 
tribution that rewards smooth sources. 

In using pixels to describe the source light profile, we 
are restricting our resolution to scales larger than the size 
of the pixels. It is often clear how to judge in advance the 
pixel size that is necessary. With gravitational lensing, 
however, intuition suggests that our image will allow us 
to reconstruct the source with better resolution in some 
regions of the source plane than in others. For example, 
we might expect that we could achieve higher resolution 
in the reconstructed source for areas in the source plane 
that are near the caustics - these regions have been im- 
aged more times (ie they affect more pixels of the image) 
than the o thers, so our image pr ovides more constraints 
on them. iDve fc WarrenI 1)20051) have devised a clever 
method of adaptive pixellation of the source plane for 
use with their semi-linear inversion method, that puts 
more pixels on high magnification regions. 

In this paper we use a uniform pixel size, and since 
we calculate the posterior distribution for the brightness 
of each pixel, will be able to see directly which pixels 
are strongly constrained by the observations, and which 
ones aren't. Hence we can check if the high magnifica- 
tion regions of the source plane really are more tightly 
constrained by the data. 

In reality, sources are not made up of pixels, ie square 
regions of constant surface brightness. We are obliged to 



use them because we must have some way of describing 
the source by a finite set of numbers. However, observed 
images taken with CCDs do actually consist of pixels. 
The intensity in each pixel corresponds to an estimate 
of the integral of the true surface brightness ( "intensity 
density" ) of the image, over the area covered by the pixel. 
Hence, the object we wish to infer (the source) is a sur- 
face brightness function S(xs-, Vs), yet our observed data 
consists only of integrals of that function over certain 
regions. Thus, our data can only ever constrain those in- 
tegrals, and prior information is the only way of deciding 
between different reconstructions that all fit the data. 

Bayesian methods give us the posterior PDF of all of 
the pixel values, given the data. If we reconstruct with 
a large number of pixels, then any individual pixel is 
likely to be constrained very weakly by the data, so the 
marginal posterior PDF of any individual pixel will be 
very wide. However, a question we would like to answer 
is "if the lensing object was not there, and we observed 
the source, what would we see?" . Since observed images 
are pixellated, in effect we want to estimate the integral 
of the surface brightness over some square regions, which 
may be larger than the pixels we did the reconstruction 
with. We will see that, while the individual pixels are 
only weakly constrained by the data, sums of them over 
particular regions (which approximate the integral of the 
surface brightness over those regions) can be quite pre- 
cisely determined. 

A strong analogy can be drawn between this reason- 
ing and that used in statistical mechanics. In statistical 
mechanics, we start off completely ignorant of the mi- 
croscopic state of the system. Then, measurement of a 
macroscopic thermodynamic variable narrows down the 
range of possible "microstates" that the system could be 
in, but there are still a huge number that are compatible 
with the macroscopic constraints imposed by the data 
[the "ensembles" of statistical mechanics are somewhat 
like posterior PDFs of th e micro s copic variables, given 
the macroscopic data, see I.Tavnesllll957l) ]. When we use 
this to make predictions about other macroscopic quan- 
tities, though, we find that there is enough information 
to make accurate predictions of these. 

5.2. Underdetermined Problems 

It is well known that, when attempting to infer a large 
number of parameters from some noisy data, the large 
number of parameters makes it possible to "overfit" the 
data by fitting the noise. With conventional strategies, 
the best possible fit within the parameter space is ob- 
tained, and is usually accepted or rejected based on the 
criterion (Press 2002). This penalises the use of large 
numbers of parameters, and behaves like a kind of "Oc- 
cam's Razor" . From a Bayesian perspective, however, it 
is clear that is not the final answer to the question 
of goodness of fit. It is possible that a correct model 
could give a best fit which is not good enough by the 
criterion, yet the fit is still good over a wide range 
of parameter space. To reject a model altogether, it is 
not the value of of the best fitting parameters that is 
important (although it may be a useful indicator), but 
the "evidence" value P(0|/), which is required in order 
to test one model against a particular alternative model 
I2, as can be seen by writing down Bayes' theorem for 
the posterior probabilities for two competing models Ii 
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and I2 ■ 

The answer we get from Bayesian analysis is the joint 
posterior PDF of all of the parameters, indicating what 
values are plausible in the light of the data. Even though 
the best fitting set of parameters may dramatically overfit 
the data, the volume of parameter space that is near 
this peak is exceedingly small, and decreases with the 
number of parameters. Thus, the overwhelming majority 
of plausible reconstructions do not "fit the noise" , even 
if wc use a larg e number of parameters. 

W arren fc Dve (2003) noted that the use of "regulari- 
sation" made the criterion inapplicable, by effectively 
reducing the number of free paramaters, by tying some 
together or limiting their allowed ranges. Since the use 
of regularisation is equivalent to a particular choice of 
prior distribution (see section IK^ . it is possible to com- 
pare different reconstructions objectively. The theory 
of B ayesian model selection l|Hobson. Bridle, fc Lahavl 
120021^ provides the means for accomplishing this, and 
automatically displays phenomena such as Occam's Ra- 
zor, and penalising models that require fine tuning (one 
model may fit the data well, but only in a small region 
of its parameter space, while another model which fits 
the data over a wide range of its parameter space will be 
preferred). Unfortunately, this usually involves integrals 
of complicated functions over many dimensions, so is of- 
ten impractical. With this problem, is not strictly 
applicable, but the alternative is difficult, and is beyond 
the scope of this paper. It seems as if this problem can 
become arbitrarily complicated. This is true of every 
problem in science. Any phenomenon can be modelled 
either crudely or in great detail, and it is only the limits 
in the abilities and attention spans of the people studying 
them that determines when they decide to stop. 

5.3. Gravitational Lensing of Pixellated Sources 

When both the image and the source are pixellated, 
they can be regarded as vectors I and s respectively, 
where each component of the vector is the value of 
a pixel. In general, these vectors will be of different 
lengths. Due to conservation of surface brightness, if the 
source was multiplied by a constant or had a constant 
added to it, the effect of this on the image, and hence 
the image pixel values, would be the addition or multi- 
plication of the same constant. In other words, lensing 
is a linear operator, so for the case of pixellated images 
and sources, there must be a linear relation between I 
and s. ie a matrix L exists such that 

I = Ls (10) 

The form of the matrix L is completely determined 
by the lens mass distribution p( x,y), and can be calcu- 
lated by ray tracing ifWallingto n. Naravan. fc Kochanei3 
^^). The blurring by the point spread function is also 
a linear operation, so can be represented by another ma- 
trix multiplication, alternatively, we regard L as being 
the matrix product of a lensing matrix and a blurring 
matrix. The observed image is vector O which is related 
to the true source s by 

= Ls + N (11) 



where L is an unknown lensing/blurring matrix and N 
is a vector of "random noise", the observational errors 
in each pixel of the observed image. Thus, this prob- 
lem is of a standard type jPress 200^ as the structure 
of the problem in equation 15.31 arises in many different 
contexts, and is not limited to CCD pixel values. Here, 
the matrix L is unknown, but we will describe the lens 
by a parametrised model, so the number of degrees of 
freedom of the matrix L is not too large. The point 
spread function is assumed to be known for our simula- 
tions; this is at least approximately true in practice, as 
the point spread function can be measured or simulated 
from the instrumental response to a point source. Due to 
the presence of the blurr ing, we are effecti vely also doing 
Bayesian deconvolution (jEsch et al.ll2004|) . 

5.4. Prior Distributions, Relation to Other Methods 

The choice of prior distributions often gets little at- 
tention. However, it is usually the case that neglecting 
this question by using some other method is equivalent 
to using a particular prior distribution. For example, it 
is quit e easy to show that the method of iWarren fc Dvd 
l)2003j) . being a least squares method, is equivalent to 
finding the peak of the posterior PDF p{s\OLI) for a 
uniform prior distribution ^(5*1/) — constant, and Gaus- 
sian noise. This has the obvious drawback that it allows 
pixels to be negative, and hence ignores the most basic 
prior information that we know of - surface brightness is 
a nonnegative quantity. However, this may be a small 
price to pay for the convenience it delivers. 

A simple way to rectify this is to consider the problem 
as one of constrained least squares, where the devia- 
tion between the model and the data is minimised subject 
to a positivity constraint o n all of the source pixels val- 
ues llB rewer fc Lewi s"2005') . Least squares is equivalent 
to the Bayesian approach with a uniform prior distribu- 
tion, but with negative values forbidden. There are some 
problems associated with this method that are discussed 
in section 17.11 These problems were not noticed previ- 
ously because the methods focused on optimisation of the 
source (i.e. finding the best fit), rather than exploration, 
where the set of all plausible fits is considered. 

When "re gularisation" is used ijWarren fc Dvd 120031: 
iPressI 12002(1 . whereby an additional term representing 
"quality" is added to the merit function to be opti- 
mised, this is equivalent to assuming a particular form 
of prior distribution, and then seeking the most prob- 
able reconstruction (for a given lens model) by max- 
imising p(s\OLI). T he Maximum Entropy Method 
l|Kochanek et al.lll989() is similar in nature, and has the 
added benefit that positivity of the source is guaranteed. 
These approaches tend to give results that are useful and 
visually appealing, but some issues arise when these are 
applied to astronomical images. 

Astronomical images are mostly blank. This is a simple 
and important fact that none of the above priors take into 
account, but which is important. For example, use of the 
prior that is constant everywhere where the source pixels 
are positive, while seeming very conservative about the 
values of individual pixels, can end up making dogmatic 
statements about the sum of many pixels. Use of regu- 
larisation or the cntropic prior of LensMEM actually ex- 
acerbate this problem, by making reconstructions which 
are compressed (have more moderate values and less ex- 
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treme ones) more likely a priori. In later sections with 
simulated data, we will use a prior that, while not hav- 
ing as sophisticated a justification as the entropic prior, 
is closer to what we actually think about astronomical 
data. 

We will make the conventional choice of a Gaussian dis- 
tribution for the noise, leading to the following sampling 
distribution/likelihood function p{0\sLI), 

p(0|si/)oc[|[e 'V J]=e-5x' (12) 

i=l 

where the ai are the "uncertainties" of each pixel of the 
observed image. Before applying any lensing reconstruc- 
tion procedures to the data, it has usually been subjected 
to some other procedures, such as foreground galaxy sub- 
traction and also the subtraction of the mean sky bright- 
ness. Thus, it is unreasonable to expect to be able to 
use a Poisson sampling distribution (from photon count- 
ing arguments), because after these necessary procedures 
been performed, the data are no longer integer valued, 
and can be negative. 

As long as the cr^ correctly indicate the expected (rms) 
magnitude of the noise, then the independent Gaussian 
PDF is the most conservative and general probability as- 
signment we could make. It does not necessarily have to 
correctly represent the long run frequency distribution 
of the errors in repeat ed exper i ments (if such a thing ex- 
ists), as discu ssed bv IJaynesI 120031) and demonstrated 
empirically bv iBretthorsti 1)1998^. As long as the Gaus- 
sian model does not assign extremely low probability to 
the actual errors that are present in the one observed 
image that we actually have, then no problems will arise 
from its use. The worst case scenario is that parameter 
estimates will be slightly more conservative than they 
would have been if we had additional information about 
the noise. 

5.5. Prior Information About the Lens 

In order to set up a prior distribution p{L\I) for the 
parameters of the lens model, we must consider what 
little information we have about the lens mass distri- 
bution, without considering the image. However, we are 
allowed to consider the image of the foreground galaxy as 
information which is relevant to the lens model param- 
eters. In the dark matter paradigm, there is typically 
a poor correlation between the observed light distribu- 
tion of the foregrou nd lensing galaxy and the way it s 
mass is distributed ijRusin. Kochanek. fc Keetonll2003j) . 
In this case, the prior distributions for the lens param- 
eters would be very diffuse, with the galaxy image only 
providing information about the general order of mag- 
nitude of the lens parameters. However, if we assume 
some particular relativistic theory of MOND [such as 
IBekensteirJ ()2004 lh] is true, and its implications for lensing 
are known, then the observed galaxy image will provide 
quite strong constraints on the parameters of the lensing 
model. Thus, lensing can provide a test of dark mat- 
ter vs MOND, because if MOND is used, the brightness 
profile of the lensing galaxy should all but fix the lens 
model. In the coming simulations, uniform priors are 
used for all lens parameters. This has little effect on the 
conclusions because the data are able to constrain the 
lens parameters quite strongly. 



6. MARKOV CHAIN MONTE CARLO 

It is usually straightforward to write down the poste- 
rior PDF for all of the unknown parameters (equations[3 
together with IT^ . However, written in this form it is 
often not of much use. We are usually interested in some 
of the parameters, but not all of the others. For exam- 
ple, we might want to know the posterior PDF for the 
total mass in the lens, so we can quote an estimate and 
its uncertainty. In principle, the way to deal with this is 
to obtain the marginal PDF for all of the wanted param- 
eters, by integrating the joint PDF with respect to all of 
the unwanted parameters. However, in practice this is 
usually impossible to do analytically. 

Markov Chain Monte Carlo (MCMC) methods are a 
computational tool for solving this problem. They seek 
to generate a random sam ple from the po sterior PDF 
[a cosmological example is iLewis fc Bridl3 ((2002)]. By 
looking at only one parameter of the models in the sam- 
ple (for example as a histogram), we can immediately get 
an idea of the marginal posterior PDF for that param- 
eter. We can also approximate its mean and standard 
deviation (which we might give as the estimate and the 
uncertainty) by using the mean and standard deviation 
of the sample. The Markov Chain aspect comes from the 
way these methods work. Most work by taking a kind 
of random walk through the parameter space, with the 
transition probabilities constructed in such a way that 
the stationary distribution of the Markov Chain is the 
posterior distribution that we are interested in. Then, in 
the long run, the random walk visits different regions of 
the parame ter space in proportion with their posterior 
probabihty l|Gilks et al. 1995). 

6.1. The Metropolis Algorithm 

The MetropoHs algorithm ijMetroDoHs et alJ 1195,31) is 

the starting point for many Markov Chain Monte Carlo 
schemes. Start from a point x in the parameter space. 
Denote the target distribution (which we want to sam- 
ple from, in our case it is the posterior distribution 
p(SL|0/)) by 7r(x). A proposal step is made, from the 
point X to the point y according to the proposal distribu- 
tion (/(yjx), which is assumed to be symmetric, meaning 
that g(y|x) = (/(xjy). This proposal step is always ac- 
cepted if 7r(y) > 7r(x), otherwise it is accepted with prob- 
ability P — ^1^- This procedure is repeated many times, 
resulting in a random walk through the parameter space 
whose long term frequency distribution is equal to the 
target distribution. The performance of this algorithm 
depends on the form of the proposal distribution (the 
method should always work, but some proposal distri- 
butions may explore the parameter space more quickly). 
Typically, a normal distribution centred at the current 
value is used, and the variance is adjusted to achieve a 
moderate acceptance rate of 20-50 per cent. 

The success (or otherwise) of Markov Chain Monte 
Carlo methods is highly dependent on the starting point 
of the random walk. If the starting point is somewhere of 
extremely low posterior probability, the chain can wan- 
der around and get stuck in local minima, possibly never 
reaching the correct region. In principle, it eventually 
will, for instance it might spend 10^ years in one local 
maximum before leaping by chance into the high proba- 
bility region, where it would spend 10^*^ years, say. Un- 
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fortunately, there are no general methods for discovering 
whether this has occurred. MCMC can also struggle to 
cope with a situation where good fit is achievable within 
two parts of the parameter space that have very different 
parameter values. This is a problem shared by virtually 
all other methods, some more so than others. If multiple 
maxima are expected to occur then simulated annealing 
or genetic algorithms may be helpful in finding the max- 
ima. Once the maxima are found using any method, they 
can be used as a starting point for the MCMC method for 
evaluating the uncertainties. For the lens modelling, we 
specified an initial point where the lens parameters were 
close to the optimal values, and the source was blank. 

7. APPLICATION TO SIMULATED DATA 

We tested this algorithm on some simulated data. The 
source profile was two Gaussians, and these were lensed 
to create an image on a 64x64 pixel grid. The image 
was created by ray tracing, with the source evaluated 
analytically. Hence we can see the effect that assuming 
pixels has on the final conclusions, when the source isn't 
really made from pixels. This image was then convolved 
with a point spread function, and random noise of known 
standard deviation was also added. These are shown in 
Figure^ The units were chosen so that the image covers 
a region 2.4 Einstein Radii across. The lens model that 
was used was a Ps eudo Isothermal Elliptical Potential 
(PIEP) lens model (|B rewer fc Lewisir2005') with parame- 
ters b = 0.5, e = 0.25, Tc = 0.1, centred at the origin and 
with the principle axes of the ellipse aligned with the x 
and y axes. The b parameter describes the strength of 
the lens, e is the ellipticity, and Tc is the core radius. 

7.1. The Positive Uniform Prior 

At first, a reconstruction was attempted, using the 
prior distribution which is a constant as long as all source 
pixels are positive. The proposal distribution for mov- 
ing from one step of the chain to the next was done by 
adding a normally distributed random number to each 
pixel, and then taking the absolute value. To achieve a 
moderate acceptance rate in the Metropolis method, the 
standard deviation of the proposal distribution was set 
to 5 per cent of the noise level in the image, in the case 
where all pixels are updated at once. For this run, a 
24x24 pixel grid was used, and the lens parameters were 
held constant at the known true values; this suffices to 
illustrate the point. 

A random source sampled from the posterior distribu- 
tion is shown in Figure[21 The corresponding image, and 
the residuals (difference between the reconstructed and 
observed image) are also displayed. The deviation be- 
tween the model and the observed image is 4536, which 
is much higher than the expected value of 64^ = 4096 
if this model was correct. This is because the recon- 
structed source is much too bright in just those areas 
where it should be dark. 

This is a fatal flaw of this prior distribution - when 
it is used, the majority of "plausible" reconstructions 
don't actually match the data! There are several rea- 
sons why this effect was not noticed in previous studies. 
Firstly, they focused on optimisation rather than explo- 
ration. The source that maximises the posterior PDF 
would surely fit the data, but the volume of parameter 
space around this source is very small due to the pos- 
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Fig. 1. — Simulated data for testing our method. The source was 
a double Gaussian, which was lensed via a ray tracing method. The 
sizes of the regions of the source plane and image plane that are 
covered by these pictures are the same throughout the paper. The 
caustics and critical lines are shown on the source plane and the 
image plane respectively. Throughout the paper, we consider only 
small changes in the lens parameters, so the caustic and critical 
line geometries do not change significantly. 
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Lensed and Blurred Image 





Residuals. % = 4536 




Fig. 2. — A random sample source from the posterior PDF, using 
the constant prior. Note that the reconstruction is too bright in 
extended areas, particularly where the source is actually dark, and 
not multiply imaged (the outer regions). 



itivity cutoffs. Also, some authors have chosen to put 
a "mask" around the image, so that the blank regions 
are de emed irrelevant an d not included in the calcula- 
tions (Wavth et al. 2005). Thus, they only looked at the 
residuals in the bright regions of the image, which is the 
region that isn't affected by this problem. By masking 
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out blank parts of the image, possibly important infor- 
mation is also being wasted. For example, suppose a 
model was found that reproduces the bright parts of the 
image faithfully, but also predicts that there should be 
bright regions where in fact darkness has been observed. 
This model would be acceptable if the mask was used, 
but if we always use the whole image, this situation can 
never arise. 

When the lens parameters were also allowed to vary, 
the estimates obtained were also slightly incorrect, with 
the true values lying several standard deviations away 
from the estimates. Thus, using a positive constant prior 
(or something that turns out to be equivalent to it, such 
as constrained least squares) in this situation can bias 
the lens model results slightly. The biasing effect of the 
uniform prior increases with the number of pixels that are 
used, since for more pixels, the implied prior estimate on 
the integrated flux over any region becomes higher. 

7.2. Two Reconstructions 

The idea behind the prior distributions that were used 
in the remainder of the runs was as follows. Each 
pixel has an independent distribution, with some positive 
probability of being very close to zero. This accounts for 
the "astronomical images have a lot of blank regions" ob- 
servation noted earlier. If a pixel is not blank, then we at 
least know the order of magnitude of the surface bright- 
ness distribution, so the positive part of the prior distri- 
bution was taken as an exponential with a mean value 
of 100 (the typical brightness scale of the bright parts of 
the image). The prior distributions will be written down 
explicitly in later sections. A more sophisticated analysis 
leads to a slightly di fferent prior w hich has been called 
"Massive Inference" l|Skillinelll998j) . which is a generali- 
sation of maximum entropy, that has similar features to 
our priors. In the following sections we show how pri- 
ors based on these seemingly trivial observations lead to 
more sensible reconstructions. 

In all of the following simulations, the lens parameters 
were also varied. Since this is a more computationally in- 
tensive task, involving ray tracing, it was only done once 
every 10th step. Unfortunately, to achieve a moderate 
acceptance rate, the proposal distribution for the lens 
model had to be quite narrow. We used a normal dis- 
tribution centred at the current lens parameters, with a 
standard deviation of 0.001. For the first 50,000 steps of 
the chain, only the source was varied, so that the recon- 
struction is reasonable before the lens parameters start 
changing. 

7.2.1. Not Many Pixels 

In this section we present two different reconstructions 
of the same simulated data, which both lead to sensible 
results. At first, we reconstructed with a 16x16 pixel 
model for the source, covering a square region from - 
0.5 to 0.5 Einstein radii in the source plane. The prior 
probability density on each pixel was taken as a mixture 
of two exponentials, with 50 per cent weighting given to 
each: 

p{s,\h) = ^exp{-sj2)/2 + iexp(~s,/100)/100 (13) 

The first exponential gives a 50 per cent prior prob- 
ability of the pixel being quite dark, while the other 
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Fig. 3. — A random sample source from the posterior PDF, 
using the mixture of exponentials prior. The residuals appear as 
a noise map, without the overly bright regions that occurred with 
the uniform prior. 



exponential allows for the possibility that the pixel is 
quite bright. The MCMC approach that was used for 
this reconstruction was a slightly modified version of the 
Metropolis algorithm. The prior distribution was actu- 
ally incorporated within the proposal step, rather than 
in calculating the ratio of the posterior probabilities. In 
other words, the proposal step was created such that it 
would have sampled from the prior distribution, in the 
absence of any data. Then the acceptance/rejection de- 
cision was based on just the ratio of the likelihoods, as 
the prior had already been taken into account. Figure |21 
shows a sample reconstruction from the posterior distri- 
bution, both the source and the resulting blurred image. 
The problem of overly bright areas in the reconstruction 
has disappeared, and the residuals look just like a uni- 
form noise map, as they should. 

The mean and standard deviation of the entire sam- 
ple are also shown (Figure 01) , as is the posterior PDF 
of the total amount of light in the source. As described 
in the introduction, the marginal posterior PDF of each 
pixel is quite wide, but the inference about a "macro" 
quantity (the total integrated intensity) is fairly strongly 
constrained at 9.75 ± 0.21 (posterior mean ± standard 
deviation, so a "1-sigma" uncertainty). The reason for 
displaying the mean of all of the reconstructions is be- 
cause expectation values satisfy a linear property, so if 
we want to predict any integral property of the source 
(for instance, the sum of all pixels within a certain re- 
gion), the mean of the sum is equal to the sum of the 
means. 

The trajectory of the lens model parameters, and the 
corresponding histograms, are shown in Figure |5l Only 
every 100,000th step of the chain was saved, so that there 
are no strong correlations between one point and the 
next. We can see from the histograms what the inference 
would be about the three "interesting" lens parameters, 
6, e and Vc- The position of the centre of the lens, and the 
angle of its orientation, were also varied and the marginal 
PDFs of those parameters can also be viewed, but since 
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Fig. 4. — Reconstructions with the 16x16 pixel grid, and the 
prior distribution which allowed a significant chance for each pixel 
to be dark. The uncertainty of each pixel (taken as the standard 
deviation of the marginal posterior distribution) is largest for those 
pixels that are bright, and is smallest for the pixels that are highly 
magnified. This can be seen from the presence of the diamond 
shaped dark region in the uncertainty map - this is precisely the 
high magnification region of the favoured lens model. 



Fig. 6. — A random sample source from the posterior PDF, using 
the 64x64 pixel grid for the source, and the 90 per cent dark prior. 
The residuals appear like a noise map, as expected. 
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Fig. 5. — Lens model parameters with the 16x16 pixel grid. With 
this particular pixellation and prior distribution for the source, the 
lens parameters are quite accurately determined (the true values 
are 0.5, 0.25 and 0.1 respectively). 



these are uninteresting nuisance parameters, we have not 
displayed them here. 

With a prior distribution on each pixel that allows a 
significant probability of the pixel being very dark, the 
inference about the lens parameters is more accurate 
than with the uniform positive prior. The measured val- 
ues of b, e and would be 0.4993±0.0022, 0.2521±0.005 



and 0.0975 ± 0.0053, respectively, which are consistent 
with the known true values of 0.5, 0.25 and 0.1. A simi- 
lar consistency is also seen with the position of the centre 
of the lens, and its orientation angle, for which the true 
values were zero. 

7.2.2. Many Pixels 

An alternative pixellated model was also attempted, 
with 64x64 pixels. With this reconstruction, we used a 
more informative prior, this time with a prior probability 
of 90 per cent that the pixel is exactly zero: 

p{s,\h) - ^5{s,) + lexp(-5,/100)/100 (14) 

To incorporate this in the Metropolis method, we pro- 
ceeded as before by defining the proposal step so that it 
would have sampled from the prior distribution if there 
wasn't any data. In each proposal step, one pixel was 
selected at random, and switched off with probability 
0.9, or on with probablility 0.1, in which case its surface 
brightness was taken from an exponential distribution 
with a mean of 100. The accept/reject decision was then 
based on the likelihood ratio of the proposed step to the 
current one. 

Usually, if a proper model selection analysis is done 
(by calculating P(/2|0)/P(/i|0)), the 16x16 reconstruc- 
tion would be favoured because it has managed to ex- 
plain the data adequately with fewer parameters. Within 
the Bayesian framework, however, the more informative 
prior distribution in the 64x64 case counteracts this ten- 
dency to some extent (in principle this can be calculated, 
but it is beyond the scope of this paper). This is because 
the posterior probability of a model depends not only on 
the number of parameters, but also on the amount of 
prior information that there is about the values of the 
parameters. A model is favoured if it predicts the ob- 
served data (by assigning high probability to it) - but if 
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Fig. 7. — Reconstructions with the 64x64 pixel grid, and the 
prior distribution which allowed a 90 per cent prior probability of 
a zero in any pixel. The magnification pattern is again present in 
the uncertainty maps - high magnification regions have low uncer- 
tainties, as expected. 
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Fig. 8. — Lens model parameters with the 64x64 pixel grid. Due 
to the increased flexibility that the source is allowed to have with 
this parameterisation, the lens model parameters are less strongly 
constrained, with the uncertainties increased by a factor of ^ 3. 
However, the measurements are still accurate enough for most pur- 
poses. 



we do not know much about the values of parameters, 
then we know less about what the model actually pre- 
dicts. Hence it must spread its share of probability over 



a larger range of possible data sets, and therefore assigns 
lower probability to the data set that was actually ob- 
served. 

From the error maps in Figure[3 it is clear that the un- 
certainty as to the value of a pixel is far from uniform over 
the source plane. The standard deviation image shows 
that the uncertainty is greatest in the bright regions. 
There is also a slight effect due to the magnification pat- 
tern, as the diamond-shaped caustic shows up weakly 
as a region of lower uncertainty. However, the fractional 
uncertainty (standard deviation divided by the mean) ac- 
tually follows the opposite trend. The bright values have 
low percentage error, and within the high magnification 
regions, the percentage error is highest. This suggests 
that the answer as to which regions of the source plane 
are more constrained is not so clear cut. 



8. CONCLUSIONS 

Various methods have been proposed for analysing ex- 
tended gravitational lens systems, in order to extract 
as much information as possible about the lens and the 
source. Most have been found to be useful, but all have 
limitations. One feature common to all previous methods 
is the lack of a simple way to evaluate the uncertainties 
in all of the inferred quantities. By using Bayesian infer- 
ence, it is possible to reinterpret these methods, showing 
that they are all equivalent to Bayesian inference with 
different priors. This suggests how methods may be im- 
proved by using more informative priors, and also how 
uncertainties should be quantified by exploring, rather 
than maximising, the resulting probability distribution. 
In this paper, we used a simple Markov Chain Monte 
Carlo method to explore the parameter space and sam- 
ple from the posterior distribution, leading to quantita- 
tive uncertainties on all of the lens model parameters, 
and also the source pixel brightnesses. Simulated data 
was used, but we hope to apply this method to actual 
data in the near future. 

When modeUing the Einstein rin g ER 0047-2808 us- 
ing the Maximum Entropy Method. Wavth et al] l)2005fl 
showed that is usually possible to find a plausible recon- 
struction of an image with several different lens models, 
when only the best fit parameters are considered. Thus, 
the lens models remain on an even footing and we can- 
not distinguish which one is more realistic. Luckily, many 
conclusions about the source are unchanged by the use 
of different lens models. However, it should be possible 
to do Bayesian model selection analysis which may show 
that one of the models is preferred because it fits the 
data over a larger volume of its parameter space. This 
possibility will be explored in future contributions, along 
with an application of these ideas to ER 0047-2808 and 
other extended gravitational lens systems. 
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